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Abstract. Product between mode-n unfolding Y( n ) of an N-D tensor J) and Khatri-Rao products of (N - 1 ) 
factor matrices A (m) ,m = 1, . . . ,n - l,n + 1, ... ,N exists in algorithms for CANDECOMP/PARAFAC (CP). If J/ is 
an error tensor of a tensor approximation, this product is the gradient of a cost function with respect to factors, and 
has the largest workload in most CP algorithms. In this paper, a fast method to compute this product is proposed. 
Experimental verification shows that the fast CP gradient can accelerate the CP.ALS algorithm 2 times and 8 times 
faster for factorizations of 3-D and 4-D tensors, and the speed-up ratios can be 20-30 times for higher dimensional 
tensors. 
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1. Introduction. Canonical polyadic decomposition also coined CANDECOMP/PARAFAC 
(CP) [816] is a common tensor factorization which has found applications such as in chemo- 
metrics, telecommunication, analysis of fMPJ data, time-varying EEG spectrum, data min- 
ing CHEl, classification, clustering |29|, stochastic PDEs [15]. For example, CP was applied 
to analyze the auditory tones by Carroll and Chang [8], or to vowel-sound data by Harsh- 
man [16], or to model fluorescence excitation-emission data by hidden loading components 
in chemometrics [3|. Applications of CP to sensor array processing and CDMA systems in 
telecommunications have been considered in 01211301 . In neuroscience, Field and Graupe [ 14] 
extracted topographic components model from event-related potentials data, M0rup et al. 11231 
analyzed EEG data in the time-frequency domain. 

Since the alternating least squares (ALS) algorithm was proposed ll8l [T6ll . there have 
been intensive research efforts to improve performance and accelerate convergence rate of CP 
algorithms. A number of particular techniques are developed such as line search extrapolation 
methods (HQS EH US, compression [13, or simply adding a small diagonal matrix [10|. 
Instead of alternating estimation, all-at-once algorithms such as the OPT algorithm [1], the 
PMF3, damped Gauss-Newton (dGN) algorithms 1241331 and fast dGN 1I25I26I32I are studied 
to deal with problems of a slow convergence of the ALS in some cases. Another approach is 
to consider the CP decomposition as a joint diagonalization problem [11 ,21.. 22, 28]|28) . 

CP algorithms can speed-up convergence rate, or cope with difficult problems. However, 
in all CP algorithms, the largest workload is product of tensor unfoldings and all-but-one 
factors which has not been inadequately considered. If a tensor of size I\ X 1% X • • • X In is 
an error tensor of a data tensor and its approximation, the products express the gradients of 
a cost function with respect to factors of size /„ x R. Hereinafter, we call this product "CP 
gradient". The CP gradients with respect to all the factors have a high computational cost of 



order O 



nr\\i„ 



In addition, mode-n tensor unfoldings with n = 2,3,..., N - 1 are also 



time consuming due to accessing non-contiguous blocks of data entries and shuffling their 
orders stored in memory. For high dimensional data tensors such as /V > 4, the CP gradients 
may become very computational demanding. Experimental results show that it might take 
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several hours to factorize 7-D tensors if they comprise hundreds of millions or billions of 
entries (e.g., a tensor of size 10 x 10 x 10 x 10 x 10 x 10 x 10) and has high rank R = 10. 
In this paper, a fast computation method is proposed for the CP gradients. The method 



avoids mode-n tensor unfoldings, and reduces the computational cost to O 



N \ 



for com- 



puting CP gradients over all modes. 

The paper is organized as follows. Notation and basic multilinear algebra are briefly re- 
viewed in Section|2] CP model and CP gradients are shortly reviewed in this section. The fast 
computation method is presented in Section|3] The fast implementation of the ALS algorithm 
utilizing the fast CP gradient is introduced in Section |4] In Section we provide examples 
illustrating the validity and performance of the proposed algorithm. Section [6] concludes the 
paper. 

2. Notation and CANDECOMP/PARAFAC (CP) model. We shall denote a tensor 
by bold calligraphic letters, e.g., 3i e R /,X/2X ' " xl " , matrices by bold capital letters, e.g. A 
=\_a\, d2, ■ ■ ■ , a«] € R /xS , and vectors by bold italic letters, e.g. a ; or/ = [h,h, ...,/#]. 

An i = ,12,..., ;'iv]-th entry yi = }f(i\ , z'2, , . . , jV) with 1 < i n < /„, n = 1, 2, . . . , N is 
alternatively denoted by y, with the index i — ivec(i, /£| defined as 

N n-l 

i = ivec(i, /) = ii+Yj 0'» - !) ]1 *J- (2A) 

n=2 7=1 

A vector of integer numbers is denoted by colon notation such as k = v. j = [i, 2+1 , . . . , j— 
l,j]- For example, we denote 1 :n = [ 1 , 2, . . . , n] . 

Generally, we adopt notation used in [9 20 1. The Kronecker product, the Khatri-Rao 
(column-wise Kronecker) product, and the (element-wise) Hadamard product and division 
are denoted respectively by ®, O, ©, Il9ll20ll. 

Notation 2.1. (Hadamard and Khatri-Rao products of matrices) Given a set of N 
matrices A^"' 6 R 7 " xS , n — 1,2, ... ,N, Hadamard and Khatri-Rao products among them are 
denoted by 

®A (k) = A (A ' ) ©---©A ( " +1, ©A ( "- 1) ©---©A (1) , /„=/,Vn, 

ktn 
N 

A w = A w o . . . oA w q . . . q A m Vn? 

«=i 

A W = A W Q...Q A (» + D A (»-D . . . Q A (D. 



Definition 2.1 (Reshaping). The reshape operator for a tensor }f e R / i x/ 2 x '" x/ » to a size 
specified by a vector L — [L\,L,2, . . ., Lm] with Yl m= i L m — Yl„=\ In returns an M-D tensor 
X, such that vec(J/) = vec(<Y), and is expressed as 

X = reshaped, L) e R L ' xL ^ x '■' xL «. (2.2) 

Reshape does not change the order of entries in its vectorization. 

Definition 2.2 (Tensor unfolding |4|). Unfolding a tensor if e R 7 ' X/2X ' " X ' N along modes 
r — \r\,ri, . . . , r^] and c — [c\,C2, ... ,c^-m] where [r,c] is a permutation of [1,2, .. .,A^] 



ivec is the "sub2ind" Matlab function. 
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M N-M 

aims to rearrange this tensor to be a matrix Y rxc of size J~~[ I ri X J~~[ I Cl whose entries 

k=l /=1 

(juji) are given by Y rxc (j l ,j 2 ) = JJ(i r ,h), where i r = [i ri ...i r J, i c = [i Cl . . . i Cfl _ M ], 
f = ivec(i r , / r ), h = ivec(i' c ,/ c ). 
Remark 2.1. 

1. If c = \c\ < c% < ■ ■ ■ < c^-m], then Y rxc is simplified to Y( r ). 

2. If r — n and c — [1 , . . . , n — \,n + I, . . .,N], we have mode-n matricization Y rxc — 
Y ( „). 

3 y - Y T 

■ J - 1 rxc ~ *- cxr 

4. For r — [1,2, ... ,n], c = [n+ 1, n + 2, . . , , N], Vn, Y rxc = Y( r ) can be expressed and 
efficiently performed by reshape, that is 

n N 

Y w = reshaped, [J n ,K n ]), J n = ]~[ I k , K„ = ]~[ I k . (2.3) 

k=\ k=n+\ 

Definition 2.3. (mode-n tensor-vector product) The mode-n multiplication of a tensor 
\f e jg/ix/zx-x/jv fry a vec tor a e R 7 " returns an (N - \)-D tensor X defined as 

vec(Z) = Yf„ )fl . (2.4) 

Symbolically, the product is denoted by 

Z = y x„ a € R 7 ' x '- x7 "-' x7 " +1 x '" x7w . (2.5) 

Tensor-vector product of a tensor if with a set of N column vectors {a} = {a (1) , a (2) , . . . , 
a (iV) } is denoted by 

J/x{a} = 2/xia (1) x 2 a (2) ---x A ra (iV) . (2.6) 



Definition 2.4. (CANDECOMP/PARAFAC (CP)) Factorize a given N-th order data 
tensorif e R 7 i x ^x-x4 into a set of N component matrices (factors): A {n) = [a^ ) ,a^ ,) , . . .,a^ n) ] 
6 R 7 " xfl , (« = 1,2, ... ,N) representing the common (loading) factors [8. 16. 17 J, that is, 

j/^«;. 1) o«; 2 )o...o«w = #, (2.7) 

r=l 

where symbol "o" denotes outer product. Tensor if is an approximation of the data tensor 
}). Mode-n matricization of J/ can be represented as: 

\A:#« / 

2.1. Complexity of Tensor Unfoldings. Tensor unfoldings are to rearrange entries of 
tensors to be matrices. We note that entries of the tensor J/ are stored as a long vector vec(J/) 
of the size n^li 4 m memory. From this view point, tensor unfolding is to change the order 
to entries in its vectorization. The more the changes of entries take place, the slower the 
unfolding are. Moreover, reading data (entries) stored in non-contiguous blocks will be at a 
slower rate than accessing data stored in a contiguous block. 
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The mode-1 unfolding Y (1) comprises J_\ = 72/3 ■ ■ ■ Jn column vectors which consist of 
I\ contiguous entries of if 



yi y/,+i 



ycj_i-i)/i+i 



yi, yih ■■■ yj N 

[ m:h) W\ + l:2/i) 



y((7_i - + I Jn) ]. 



(2.8) 



(2.9) 



By taking into account that Yqv) = Yj ijS |) , we in practice compute Y(i : ;v-i) instead of 
Y(jv). Y(i-ivr-i) consists of Iff vectors each of which comprises Jff-\ contiguous entries given 
by 



Y,: 



[m-JN-i) lf(Jjt-i + l:2J N -i) ■ ■ ■ Mm - Wn-i + I-Jn)] .(2.10) 



In general, unfoldings Y(i„) do not change the order of entries of H 

Y ( i;„) - [m-Jn) y{h + h2J„) ■ ■ ■ J/((K„ - l)/„ + 1:/*)] , n = \,2,...,N. (2.11) 



Hence, they are relatively fast. We denote by X , m = [i„+i,i n 



+2, 



, j'iv], n-dimensional 



subtensors of if whose each entry is given by X {i\ , h, . . . , i„) = }f{i\ , i%, . . . , i„, i n+ 



An) 



The mode-n unfolding Y(„) of the size /„ x /_„, /_„ = I\ ■ ■ ■ 7„_i I n+ \ ■ ■ ■ In = J n -\ K„, can be 
expressed as concatenation of K„ mode-n unfoldings of X (m> 



Y («) - [ X (i? 



Am) 



x (m) = 

(n) 



) » 

y»ij„+i 

ymJ„+J„-i + l 
ymJ„-J„-, + l 



ymJ„+J n _i 

)W„+2y„_i+i 



Xgf ], M=[I n+ iJ n+ 2,...J N \, 
m = ivec(m, M) - 1. 



y(m+l)J„ 



Therefore, most entries of Y ( „) have changed their orders. This is why the mode-n unfoldings 
Y(„) for 1 < n < N are more time consuming, and relatively slower than unfoldings Y(i :n ). 



2.2. Gradients in CP Algorithms. We consider the cost function 

D = \\y-y\&, 



(2.12) 



and the gradients of this cost function with respect to the factor A (n) , n = 1, 2, . . . , N are given 
by lEUEa 

G«=Efa) (Oa<*>Uy w (Oa<*>) -A (n >(®A (k)T A ( A eR / " xS ,n=l,...,^2.13) 

\k*n ) \k*n ) \k±n ) 

where E(„) denotes the mode-n unfolding of the error tensor £ = if — if . The product 
E(„) (©A*) or Y(„) 0A (t) has a computational cost of order O(RJn), and is the most 



\k±n 



expensive step in CP algorithms. Indeed, the mode-n unfoldings Y(„) for n > 1 are time- 
consuming, but are not appropriately computed. The latter product Y(„) 



A w is more 



5 



Algorithm 1: Direct Computation of Y (n) A w ) (5]|6) 

Input: if: (h x I 2 x ■ ■ ■ x I N ), N matrices A (n) e R'" xR 

Output: G (n) = Y w (Ot*,A*>) : I n xR 

begin 

1 if <— permute(J/, [n, l:n — l,n+ 1:AT]) % tensor transposition 

2 Y ( „) <— reshape(J/, [/„, /_„]) % tensor unfolding Y (n) 

3 Lgm=y ( „)(GUa«) 



efficient than E(„j ( © A (i) ) in the sense of computation because it does not need to construct 
the error tensor £. However, since both products involve the same mathematical expression, 
we also call Yr„i 



A ( ' the CP gradient in which if is considered as an error tensor. 

\ten j 

The CP gradients are employed in almost all CP algorithms. For example, the alternating 
least squares (ALS) algorithm [2 8 16 30.31 1 alternatively minimizes the cost function (12.12b 
with an update rule given by 



A W «_ Y(n) (© A«) (© A« r A.A ' 



(n = 1,2, 



,A0, 



(2.14) 



where "f " denotes the pseudo-inverse. A fast implementation of ALS for 3-way tensor 031 
reduces the expensive computation of Y(„) (Qk^n A^. Unfortunately, this algorithm cannot 
be generalized to higher orders ll34l . The all-at-once algorithms such as OPT [ 1 1, PMF3, the 
damped Gauss-Newton (dGN) algorithms [24-26, 32, 33 1 compute gradients in their update 
rules 



r\g, V>0, 



(2.15) 



(2.16) 



where a = jvec(A (1) ) 7 • • • vec(A (n) ) 7 
sian and g is the gradient defined as 



g = 



3D 

da 



3D 



dvec(AW) 



vec(A w ) 7 



3D 
3vec(AW) 



H denotes the (approximate) Hes- 



dD \ T 
3vec(AW)j 



(2.17) 
also 



For a nonnegative tensor factorization, the well-known multiplicative algorithm 
involves the CP gradients 

A« <- AW®|Y( n) (OA®J)0|A w (®A« r A w )) , (n = 1,2,. ..,iV). (2.18) 

The direct computation of the product Y(„) (©^„ A®) for single mode is illustrated in 
AlgorithmQ] and is implemented in the mttkrp function of the Matlab Tensor toolbox 016]. 
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3. Fast Computation of CP Gradient. 

3.1. Order of Dimensions. The CP gradient G (n) given by 

GW = Y W ( O A«) = [y w ® af\ Y w «f , . . . , Y w <8> «f 

\k£n J k+n ktn ktn 

involves R products 



eR 4xR (3.1) 



><„_! x„ +1 ■ ■ • x w «™ . r = 1, . ..RQ.2) 



,(«+!) 



(,V) 



The Kronecker products t = can be efficiently computed by the following scheme 

011 



t <- a (2) ® a (1) , f<-a (3) ®f, 
f <- a ( " +1) ® f, f <- a ( " +2) ® f, 



(3.3) 



with a computational cost of 0(Yll=2 h+ — Tlk=n+\ ■'*)• 

Assuming that 7# < /jv-i, we transpose J/ following = [1 : N — 2, N, N — 1] to obtain 
tensor J/ <p> of the size 7i X • ■ • X I n-2 xIn x 7/v_i . The tensor- vector products in (13. 2\ can be 
expressed by 



m n k l\ {««}xf =n+1 {««} = ^ <p> xZl{a^x^ +1 {a^}x N _ ia f> x N af~» 

I N-2 \ ln-\ \\ 

® a® ® <8>a« 



(«) 



(3.4) 



According to the above computation scheme in (13.31 1. the Kronecker products in (13.4-b require 
a computational cost of 



O 



'n-\ j / AT-2 \\ /n-1 \ N ^ 

2 7 * + T Jk + Jn - 2 In + Jn <0 ^Zj Jk + J ^Zj Jk 

a=2 " \k=n+l )) \k=2 " k=n+l 



k=n+l ) 



(3.5) 



by noting that Jn-\ = J n-2 In-\- As a result, in order to efficiently compute G ( "\ we need to 
permute the tensor if such that I[ < h < ■ ■ ■ < In- Hereinafter, we implicitly assume that the 
data tensor has been rearranged in the ascending order of its dimensions. 

From ( 13.3b . computation of G^ in ( 13. U requires a number of multiplications of 



/ H-l 



N \ 



K k=2 



k=n+l ) 



(3.6) 



In a particular case when 7„ = 7, Vn, Algorithm Q] executes a number of multiplications of 

M Alg ^ri) = RJ? k=2 lK 

3.2. Fast Gradient with Respect to A Specific Factor. The direct computation of 
q(«) _ [g-W] i n (|3.l[ ) involves the tensor unfolding Y(„) which is relatively slow to obtain 
for 1 < n < N, due to accessing non-contiguous blocks of entries. We note that vectors g r 
can be expressed in an equivalent form consisting of tensor-vector products J/ x" k z\ and 
y x^ n+I a? on the right side and left side of n, that is 



(3.7) 



7 



g("> = (j/xjl„ +1 ««) x£a« 



(3.8) 



We show in the sequel, that the former way, (13. 7K is less computationally demanding for 
< ^T n -i> an d the latter way, d3.8t is less demanding in the opposite case, /„ > K„-\ . 
Note that the inner tensor-vector products in (13.7) and (13.8b can be efficiently computed 
through Y(i ; „_i) and Y(i :n ) as 



J?w = y x n k z\ af } = reshape 



B-l 



YL-n0«®, Un,...JNl 



k=\ 



N 

n (r,n) A y fl ( t) = reshape (g) ^ 

k=n+l 



(3.9) 
(3.10) 



It means that the reshapings of }f to Y ( „) with 1 < n < N are avoided. 

Let us discuss complexity of the two ways of computing the gradients g" separately. 

3.2.1. Thecase/,, < K„-\. The computation proceeds first by computing the n-dimensional 
tensors « (r ' M) defined in dXTOl l for all r = 1, . . . ,R. This operation requires the number of mul- 
tiplications of 



M KU = R 



N \ 



Jn + — ^ h 



k=n+2 I 



(3.11) 



The second step consists in computing as 



„(«) _ R (n«) 



k=\ 



(3.12) 



where R^'"' is the mode-n unfolding of l R { - r,n) . The second step has the complexity 



-(«) 



n-l \ 



k=2 J 



The total number of multiplications is 



(3.13) 



M RL (n) = R 



= R 



n-l ^ 



" ifc=n+2 t=2 / V " *=n+2 jt=2 , 



n-l \ 



" fc=n+l 



(3.14) 



which is less than that of Algorithm Q] due to J„ > I„ and /„ < I n+ \. That means the right-to- 
left projections should be faster than AlgorithmQ] 

3.2.2. The case J„ > K n -\. Here we build up (N - n + 1)-D tensors £}- r ' n) of size /„ x 
7„+i x • • • x In defined in (13.9b . This step requires the number of multiplications of 



n-l \ 



M LRl = R 



Jn + 2^ J k 



(3.15) 



k=2 J 
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The second step consists in computing the product 



g r = L 



(r.n) 
(1) 



® af 



k=n+\ 



(3.16) 



where L^f is mode-1 unfolding of jj-^. The number of multiplications in (13.16b is given 
by 



Mua=R\K„ 



1 N ^ 

T Z" 



(3.17) 



From ( 13.15b and ( 13.17b , the proposed algorithm requires a total number of multiplications of 



M LR (n) = R 



n-1 j N \ / n-1 j JV 

/jV + ^ /* + ^n-1 + Y Tj Jk <R J N + Y^ Jk + J " + T Yj Jk 

k=2 " k=n+2 J V k=2 " k=n+2 J 



M RL (n) < M Alg ^n), 



1 < n < N, 



(3.18) 



which is less than Mrl in ( 13.14b in the previous subsection and of Algorithm[T]due to K„-\ < 
J n and J n > I„. That means the left-to-right projections should be faster than AlgorithmQ] 

3.3. Fast CP gradient From Adjacent Ones. CP algorithms available in the literature 
compute all G (n) n = 1,2, ...,N either sequentially (in alternating algorithms ll2ll8~| [T6][T8l 
|23]|2)]|33]) or simultaneously (as in all-at-once algorithms ffl l24H26ll3"2]|33"l , line-search fl27] 
|33l ). This section will present a fast method to compute the gradients recursively for all 
n = l,...,N. 

Note that 



ft(r,n) = y ^ fl (*) = ^ fl ( 



(b+1) 



(3.19) 



or 

vec(# r - n) ) = R££ (3.20) 

Similarly, 

J<r,n) = a (k) = £ (,y,-l) Sj fl («-l, i (3 21) 

or 

vec(X (nn) ) = L[;f )r a*"- 1 '. (3.22) 

By exploiting relations in ( 13.20b and (13.221 i. we can quickly derive « (n) from 7? (n+1) or £ (n) 
from .£ instead of fully computing them as in ( 13.10b and ( 13.9b . respectively. The total 
number of multiplications of the algorithm is summarized in Table l4.ll and is lower than that 
of Algorithm!]] 

The proposed algorithm to compute CP gradients over all modes is summarized as Al- 
gorithmic] Gradient G ( " ' or G ( " - 1 is first computed where n* = maxjn; J„ < K„}. Gradients 
G (,l) for n = n* — 1 , n* — 2, . . . , 1, and G™ for n = n* + l,n* + 2, ... ,N are then sequentially 
computed. Note that in addition to the lower number of multiplications, Algorithm [2] avoids 
unfolding Y(„) (1 < n < N) which is time consuming. Therefore, the higher the dimension 
of the tensor is, the more significant the computational saving of Algorithm [2] with respect to 
Algorithm[T]is. 
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Algorithm 2: Fast Computation of Y (n) 



A w over all modes 

\k*n ) 

Input: If . (h X I 2 X • ■ • X JV factors A (n) e R 7 " xS 
Output: G (n) = Y w I O A (k) ) : /„ X /?, n = 1, 2, . . . , N 

\k*n J 

begin 

n* = max{«; /„ < K„] where J n = I\h... /„, =/„+!•■• 
for n = n*,n* - 1, . . ., l,n* + +2, . . .,iVdo 
if (n == n*) then 



9 

10 
11 

12 
13 
14 
15 

16 



R 



: reshape(»/, [J„,K n ]) 



O A (t) 



mode-;7 unfolding of f(' 



G (n) = cp_gradient(7? (n) ,{A}, 'right') 
else if (n == n* + 1) then 



- («) 
"(i) 



(n-\ \ 
k=\ 



reshape(J/, [/„_!, ^n-i]) % raode-l unfolding of £ { " } 



G (n) = cp.gradient(X (B) , {A}, 'left') 
else if (n< n*) then 

forr = 1,2, ...,Rdo % Compute lt m) as in (TOel) 

[ vec(« (nn) ) <- reshape(« (r '" +l) ,[y„,/„ +l ])a^ +l) 

G (n) = cp_gradient(ft (,!) ,{A}, 'right') 
else 

forr = 1,2, ...,R do % Compute X^" 1 as in (FOB 

[ vec(x (r ' n) ) «- reshaped"- 1 *, [/„_!, ^^D^a*"-^ 

G (n) = cp_gradient(£ (n) , {A}, 'left') 



function G (n) = cp_gradient(Z (n) , {A}, side) 
17 for r = 1 , 2, . . . , R do 
switch s/afe 



18 
19 

20 



case 'right': gf> = reshape^-"*, [/„-i,/„]) r (<8C! 
case 'fe/f: g< n) = reshapeCZ^, [/„,*„]) (®L + i 



% in J3TT2I ) 

% Z (r ' n ) = -C" '-"' 
% in J37161) 



« (n) (i!, . . . , i n , r) = <R (r ' n \i u . . . , i n ) in Stepg] 
£ {n \r, i n , ...,i N ) = Z (r ' n \i„, i N ) in Stepg] 



4. Fast ALS Algorithm. This section presents a fast implementation of the CP ALS 
algorithm J2. 14-b in which gradients are computed using Algorithm [2] That is, the fast ALS 
algorithm is proposed to first update A ( " ' or A ( "* +1) instead of A (1) . The algorithm then 
updates sequentially A (n) for n = n* — 1, n* — 2, . . , , 1, and A ( ' !) for n = n k + 1, n* + 2, . , , , N. 
The alternating update rules (12.141 are inserted in the "for" loop in Algorithm [2] and are 
executed after computing gradients G (n) . Such strategy requires a computational cost of order 
0(RJn + NR 3 ) to complete updating all A (n) . Other alternative algorithms [9. 18. 23 | can be 
accelerated in a similar way. 
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Table 4. 1 

Comparison of the number of multiplications executed in methods to compute Y(„) Oa®. 

\k*„ I 



Number of multiplications 



Unfoldings - Order of Entries 



M , 



' n—1 y i\ 

rjjn) = R T\jk+Y X Jk + Jh 

i i~i Yl j . i 



\k=2 '" *=n+l 

(n-\ 



V J k + min(y„, K n -i) + -j- V J k + J, 



R Jk, n < n* , 



n-l + Kn-l + 2_j ^ k 



k=n+2 

n = n* \n* + 1, 



n > n* + 1 



Y ( „) 

Y(i;„*) or Y(i : „*_i), 



change 



R^f'andR^'or 



T M 

(1) 



■(H) 



no-change 



5. Simulations. In order to verify the fast CP gradients (Algorithm |2), we compared 
the fast CP_ALS algorithm in Section [4] with the ordinary CP_ALS algorithm |(8][T6l which 
was implemented in the Matlab Tensor toolbox [6 | and used the direct computation of CP 
gradients in Algorithm[TJ(mttkrp of the Tensor toolbox (ver. 2.4) [5. 6|). The Matlab codes 
of the fast CP gradient and fast CP_ALS are available at the following link: http://www. 
bsp.brain.riken.jp/~phan/fastCPgradient.rar. Random tensors with various di- 
mensions N = 3, 4, 5, 6 were randomly generated with different sizes I n = I = 10, 20, . . . , Vn. 
Both algorithms factorized the same data tensors into various rank R = 1,10, 20, . . . , / using 
the same initialization values and in 20 iterations. There was not any stopping criterion for 
both algorithms. Execution time for each algorithm was measured using the stopwatch com- 
mand: "tic" "toe" of MATLAB release 201 la on a computing server which has 1.8 GHz i7 
processor and 4 GB memory. The Tucker compression was not used in the simulations. 

Speed ratio is defined as the ratio between execution times per iterations of CP_ALS and 
the fast CP_ALS 

Execution Jime als 

(5.1) 



Execution JimehstALS 



The final results were averaged over at least 200 iterations. Fig. 15. ll illustrates speed-up ratio 
per iteration (times) for factorization of 3-D and 4-D tensors with different sizes / and ranks 
R, whereas the speed-up ratios per iterations for 5-D and 6-D tensors are given in Table |5T| 
In an extra example for decomposition of 7-D tensors with I„ = 10, V«, and R = 10, the 
ordinary ALS algorithm took an average 1.844 seconds per iteration, while the fast ALS took 
only 0.044 second per iteration, and achieved an average speed-up ratio of 42.2 times. For 
some data which consists of collinear factors, such as bottleneck or swamps [10], the ALS 
algorithm could execute thousands of iterations. Hence, the ALS algorithm [8 16] needs at 
least 1 hour to run over 2000 iterations. Whereas the fast ALS algorithm executes the same 
number of iterations only in 1 or 2 minutes. This indicates a huge potential benefit of our fast 
algorithm. 



6. Conclusions. CP gradients are always the largest workload of order O 



N \ 

nrY]i, 



most CP algorithms such as the ALS and all-at-one algorithms. Moreover, the computation 
can be time consuming due to unfoldings Y(„) with 1 < n < N. The fast computation method 
has been proposed to avoid Y(„) for 1 < n < N, and has an approximate computational cost of 
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Table 5.1 

Comparison execution times ( seconds ) between the CP ALS algorithm and the fast CP ALS using 
the fast CP gradients for random data tensors of size I\ = h = . . . = l N = I. Results for each 
combination (N, I, R) consist of execution times and speed-up ratio between two algorithms as indicated 
in the below minitab. The results were averaged over at least 200 runs on a computer which has 1.8 
GHz i7 processor and 4 GB memory. 





1 




10 




20 




30 


40 


5,20 


9 10- 3 


33.1 


0.02 


19.3 


0.02 


19.6 


3 10~ 4 


9 10~ 4 


1.2 10^3 


5,30 


0.05 


62.1 


0.09 


21.9 


0.1 


21.2 


0.2 


17.3 


9 10~ 4 


4.3 10-3 


6.2 10-3 


0.011 


5,40 


0.3 


83.1 


0.35 


21.6 


0.46 


20.1 


0.63 


17.6 


1.4 


3 10~ 3 


0.016 


0.023 


0.036 


0.046 31 6 


6, 20 


0.2 


102.9 


0.34 


24.2 


0.5 


34.7 


2 10-3 


0.01 


0.015 



Execution -time als (seconds) 
Execution-time f ast als (seconds) 









= N - 3. 1 - 50 
-V- N - 3. 1 = 100 








N - 3. 1 = 150 








-o- N = 3. 1 - 200 












b 








\\ 
































o. 














— 


>o o Q 













100 

R 

(a) 3-D tensors 











a N - 4, I - 30 
-V- N 4, 1 - 60 
-> N - 4, 1 = 90 
-o- N = 4, 1 = 1 20 
























































,0 
















,-o''' 




r. ■ Xr ""' 













































R 

(b) 4-D tensors 



Fig. 5.1. Speed-up ratios per iteration (in logarithmic scale) between the ordinary CP -ALS algorithm and its 
fast implementation using the fast CP gradients for factorization of 3-D and 4-D tensors with various sizes /„ = /, Vn, 
and ranks R. 



order O 



R 



n 



. Especially, the fast CP gradients can be used to accelerate any CP and NTF 



algorithms. Experimental results show that our algorithm can be about 8 times faster than the 
direct computation for 4-D tensors, and it can be up to 20-30 times for higher dimensional 
tensors. 
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